AITopics

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Sensing and Signal Processing > Image Processing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Neural Information Processing SystemsFeb-16-2026, 06:04:28 GMT

Unleash the Potential of Image Branch for Cross-modal 3D Object Detection

However, existing cross-modal 3D detectors do not fully utilize the image domain information to address the bottleneck issues of the LiDAR-based detectors.

artificial intelligence, computer vision and pattern recognition, machine learning, (13 more...)

Country:

Asia > China > Hong Kong (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsDec-26-2025, 11:31:38 GMT

Unleash the Potential of Image Branch for Cross-modal 3D Object Detection

To achieve reliable and precise scene understanding, autonomous vehicles typically incorporate multiple sensing modalities to capitalize on their complementary attributes. However, existing cross-modal 3D detectors do not fully utilize the image domain information to address the bottleneck issues of the LiDAR-based detectors. This paper presents a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects. First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation. This approach enables the learning of local spatial-aware features from the image modality to supplement sparse point clouds. Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch, utilizing a succinct and effective point-to-pixel module.

image branch, name change, object detection, (6 more...)

Technology:

Information Technology > Sensing and Signal Processing (0.60)
Information Technology > Artificial Intelligence > Vision (0.40)

Neural Information Processing SystemsOct-9-2025, 03:15:21 GMT

a1f0c0cd6caaa4863af5f12608edf63e-Paper-Conference.pdf

artificial intelligence, computer vision and pattern recognition, machine learning, (13 more...)

Country:

Asia > China > Hong Kong (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceAug-1-2025

MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors

Lu, Shouyi, Lin, Zihan, Lu, Chao, Wang, Huanran, Zhuo, Guirong, Zheng, Lianqing

Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism--comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement--to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.

artificial intelligence, machine learning, point cloud, (16 more...)

2507.21872

Genre: Research Report (0.64)

Industry: Information Technology (0.89)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceMar-21-2025

GAA-TSO: Geometry-Aware Assisted Depth Completion for Transparent and Specular Objects

Liu, Yizhe, Jia, Tong, Cai, Da, Wang, Hao, Chen, Dongyue

Transparent and specular objects are frequently encountered in daily life, factories, and laboratories. However, due to the unique optical properties, the depth information on these objects is usually incomplete and inaccurate, which poses significant challenges for downstream robotics tasks. Therefore, it is crucial to accurately restore the depth information of transparent and specular objects. Previous depth completion methods for these objects usually use RGB information as an additional channel of the depth image to perform depth prediction. Due to the poor-texture characteristics of transparent and specular objects, these methods that rely heavily on color information tend to generate structure-less depth predictions. Moreover, these 2D methods cannot effectively explore the 3D structure hidden in the depth channel, resulting in depth ambiguity. To this end, we propose a geometry-aware assisted depth completion method for transparent and specular objects, which focuses on exploring the 3D structural cues of the scene. Specifically, besides extracting 2D features from RGB-D input, we back-project the input depth to a point cloud and build the 3D branch to extract hierarchical scene-level 3D structural features. To exploit 3D geometric information, we design several gated cross-modal fusion modules to effectively propagate multi-level 3D geometric features to the image branch. In addition, we propose an adaptive correlation aggregation strategy to appropriately assign 3D features to the corresponding 2D features. Extensive experiments on ClearGrasp, OOD, TransCG, and STD datasets show that our method outperforms other state-of-the-art methods. We further demonstrate that our method significantly enhances the performance of downstream robotic grasping tasks.

artificial intelligence, information, machine learning, (18 more...)

2503.17106

Country:

Asia > China > Liaoning Province > Shenyang (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsJan-19-2025, 17:42:40 GMT

Unleash the Potential of Image Branch for Cross-modal 3D Object Detection

detector, image branch, object detection, (4 more...)

Technology: Information Technology > Artificial Intelligence > Vision (0.74)

Huang, Zhenhan, Pedapati, Tejaswini, Chen, Pin-Yu, Gao, Jianxi

Differentiable Prompt Learning for Vision Language Models

arXiv.org Artificial IntelligenceDec-31-2024

Prompt learning is an effective way to exploit the potential of large-scale pre-trained foundational models. Continuous prompts parameterize context tokens in prompts by turning them into differentiable vectors. Deep continuous prompts insert prompts not only in the input but also in the intermediate hidden representations. Manually designed deep continuous prompts exhibit a remarkable improvement compared to the zero-shot pre-trained model on downstream tasks. How to automate the continuous prompt design is an underexplored area, and a fundamental question arises, is manually designed deep prompt strategy optimal? To answer this question, we propose a method dubbed differentiable prompt learning (DPL). The DPL method is formulated as an optimization problem to automatically determine the optimal context length of the prompt to be added to each layer, where the objective is to maximize the performance. We test the DPL method on the pre-trained CLIP. We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence. The performance on the downstream tasks exhibits the superiority of the automatic design: our method boosts the average test accuracy by 2.60% on 11 datasets compared to baseline methods. Besides, our method focuses only on the prompt configuration (i.e. context length for each layer), which means that our method is compatible with the baseline methods that have sophisticated designs to boost the performance. The DPL method can be deployed to large language models or computer vision models at no cost.

continuous prompt, large language model, machine learning, (18 more...)

2501.00457

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Park, Seung, Shin, Yong-Goo

Improving GANs with a Feature Cycling Generator

arXiv.org Artificial IntelligenceFeb-17-2023

Generative adversarial networks (GANs), built with a generator and discriminator, significantly have advanced image generation. Typically, existing papers build their generators by stacking up multiple residual blocks since it makes ease the training of generators. However, some recent papers commented on the limitation of the residual block and proposed a new architectural unit that improves the GANs performance. Following this trend, this paper presents a novel unit, called feature cycling block (FCB), which achieves impressive results in the image generation task. Specifically, the FCB has two branches: one is a memory branch and the other is an image branch. The memory branch keeps meaningful information at each stage of the generator, whereas the image branch takes some useful features from the memory branch to produce a high-quality image. To show the capability of the proposed method, we conducted extensive experiments using various datasets including CIFAR-10, CIFAR-100, FFHQ, AFHQ, and subsets of LSUN. Experimental results demonstrate the substantial superiority of our approach over the baseline without incurring any objective functions or training skills. For instance, the proposed method improves Frechet inception distance (FID) of StyleGAN2 from 4.89 to 3.72 on the FFHQ dataset and from 6.64 to 5.57 on the LSUN Bed dataset. We believe that the pioneering attempt presented in this paper could inspire the community with better-designed generator architecture and with training objectives or skills compatible with the proposed method.

artificial intelligence, dataset, machine learning, (17 more...)

2210.09638

Country: Asia > South Korea > North Chungcheong > Cheongju-si (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Zhang, Wenwei, Wang, Zhe, Loy, Chen Change

Multi-Modality Cut and Paste for 3D Object Detection

arXiv.org Artificial IntelligenceDec-23-2020

Three-dimensional (3D) object detection is essential in autonomous driving. There are observations that multi-modality methods based on both point cloud and imagery features perform only marginally better or sometimes worse than approaches that solely use single-modality point cloud. This paper investigates the reason behind this counter-intuitive phenomenon through a careful comparison between augmentation techniques used by single modality and multi-modality methods. We found that existing augmentations practiced in single-modality detection are equally useful for multi-modality detection. Then we further present a new multi-modality augmentation approach, Multi-mOdality Cut and pAste (MoCa). MoCa boosts detection performance by cutting point cloud and imagery patches of ground-truth objects and pasting them into different scenes in a consistent manner while avoiding collision between objects. We also explore beneficial architecture design and optimization practices in implementing a good multi-modality detector. Without using ensemble of detectors, our multi-modality detector achieves new state-of-the-art performance on nuScenes dataset and competitive performance on KITTI 3D benchmark. Our method also wins the best PKL award in the 3rd nuScenes detection challenge. Code and models will be released at https://github.com/open-mmlab/mmdetection3d.

augmentation, detection, point cloud, (15 more...)

2012.12741

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Ground > Road (0.34)
Information Technology (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)